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Abstract 

Several recent randomized linear algebra algorithms rely upon fast dimension reduction 
methods. A popular choice is the Subsampled Randomized Hadamard Transform (SRHT). In 
this article, we address the efficacy, in the Probenius and spectral norms, of an SRHT-based 
low-rank matrix approximation technique introduced by Woolfc, Liberty, Rohklin, and Tygert. 
We establish a slightly better Frobenius norm error bound than currently available, and a 
much sharper spectral norm error bound (in the presence of reasonable decay of the singular 
values) . Along the way, we produce several results on matrix operations with SRHTs (such as 
approximate matrix multiplication) that may be of independent interest. Our approach builds 
upon Tropp's in "Improved analysis of the Subsampled Randomized Hadamard Transform" . 

1 Introduction 

Numerical linear algebra algorithms are traditionally deterministic. For example, given a full 
rank matrix A G M"*^"* and a vector b G W^, Gaussian elimination - assuming that the matrix 
multiplication exponent equals 3 - requires at most 2m^/3 arithmetic operations to compute a 
vector X G R" that satisfies Ax = b, while the matrix-matrix multiplication AA""" requires at 
most (2m — l)m^ operations. Another important problem is eigenvalue computation: current 
state-of-the-art solvers compute all m eigenvalues of AA""^ in O (m^) arithmetic operations. All 
these computations are deterministic, i.e ensure that the solution of the underlying problem is 
returned after the corresponding operation count. 

Although these algorithms are numerically stable and run in polynomial time, O(m^) arith- 
metic operations can be prohibitive for many applications when the size of the matrix is large, 
e.g. on the order of millions or billions [34, 33]. One way to speed up these algorithms is to 
reduce the size of A, and then apply standard deterministic procedures to the resulting matrix. 
In more detail, for a matrix fl G i^^x** (j^j > r = o(m)), let Y = Afl G M™"^''. 47 is a so-called 
"dimension reduction" matrix and Y contains as much information of A as possible. Consider 
for example the matrix-matrix multiplication operation mentioned above. In this setting, one can 
compute Y Y^ instead of A A^ . If is chosen carefully, then 

YY'^ ^ AA^, 

and the number of operations needed to compute YY""" is at most of the order o(m^) [15, 16]. 
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Recent years have produced a large body of research on designing random matrices O with 
which many popular problems in numerical linear algebra (e.g. low rank matrix approxima- 
tion [17, 18], least-squares regression [42, 6, 11], k-mcans clustering [8]) can be solved approxi- 
mately in o{m^) arithmetic operations. We refer the reader to a recent comprehensive survey of 
the topic [27], which has now emerged as Randomized Numerical Linear Algebra. 

Some proposed choices for f2 include: (i) every entry of CI takes the values -|-1, — 1 with equal 
probability [12, 35]; (ii) the entries of ft are i.i.d. Gaussian random variables with zero mean 
and unit variance [27]; (iii) the columns of ft are chosen independently from the columns of the 
m X m identity matrix with probabilities that are proportional to the Euclidean length of the 
columns of A [21, 17]; (vi) the columns of are chosen independently from the columns of the 
m X m identity matrix uniformly at random [23]; (v) ft is designed carefully such that Afl can 
be computed in at most 0(nnz(A)) arithmetic operations, where nnz(A) denotes the number of 
non-zero entries in A [13]. 

In this article we focus on the so-called Subsampled Randomized Hadamard Transform (SRHT), 
i.e. the matrix fl contains a subset of the columns of a randomized Hadamard matrix (see Defi- 
nitions 1 and 2 below). This form of dimension reduction was introduced in [1]. It is of particular 
interest because the highly structured nature of fl can be exploited to reduce the time of com- 
puting Y = AO from O(m^r) to 0(m^ log2 r) (see Lemma 3 below). 

Definition 1 (Normalized Walsh-Hadamard Matrix). Fix an integer n = 2^ , for ip = 1,2,3,.... 
The (non-normalized) n x n rnMrix of the Hadamard- Walsh transform is defined recursively as, 

+1 +1 
+1 -1 

The nxn normalized matrix of the Walsh-Hadamard transform is equal to H = n^^H^ 

Definition 2 (Subsampled Randomized Hadamard Transform (SRHT) matrix). Fix integers r 
and n = 2P with r < n and p = 1,2, 3, .... An SRHT matrix is an r x n matrix of the form 

= J- RHD; 

• D G M"^"- is a random diagonal matrix whose entries are independent random signs, i.e. 
random variables uniformly distributed on {±1}. 

• H G M**^" is a normalized Walsh-Hadamard matrix. 

• R G W^^ is a subset or r rows from the nxn identity matrix, where the rows are chosen 
uniformly at random and without replacement. 

Lemma 3 (Fast Matrix- Vector Multiplication, Theorem 2.1 in [2]). Given x G M" and r < n, 

one can construct G W^"^ and compute 0x in at most 2nlog2(r -|- 1)) operations. 

The purpose of this article is to analyze the theoretical performance of an SRHT-based ran- 
domized low-rank approximation algorithm introduced in [45] and analyzed in [45, 27, 37]. Our 
analysis (see Theorem 4) provides sharper approximation bounds than those in [45, 27, 37]. 

Our study should also be viewed as followup to the work of Drineas et al. [19] and [40, 3] 
on designing fast approximation algorithms for solving least-squares regression problems. One 
of the two algorithms presented in [19] employs the SRHT to quickly reduce the dimension of 
the least squares problem and then solves the smaller problem with a direct least-squares solver, 
while [40, 3] use the SRHT to design a good prcconditioner for an iterative method, which is then 
used to solve the regression problem. The results in this article along with the work in [43] have 
implications in all these studies [40, 19, 3]. We discuss these implications in Section 3. 
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Beyond the SRHT. Finally, notice that the SRHT is defined only when the matrix dimen- 
sion is a power of two. An alternative option is to use other structured orthonormal random- 
ized transforms such as the discrete cosine transform (DCT) or the discrete Hartley transform 
(DHT) [45, 37, 40, 3], whose entries are on the order of n~^/^. All these transforms do not place 
any restrictions on the size of the matrix. The results of this paper - with minimal effort - can be 
extended unchanged to encompass these transforms. To sec this, notice that Lemma 3.3 in [43] 
remains unchanged for all these orthogonal transforms. Thus Lemma 6 in our work as well as all 
other results presented in this article are true for these orthogonal transforms as well. 

Roadmap. This article is structured as follows. Section 1.1 introduces the notation. In Sec- 
tion 2, we present our main results on the quality of SRHT low-rank approximations and compare 
them to prior results in the literature. In Section 3, we discuss two approaches to least-squares 
regression involving SRHT dimensionality-reduction. Section 4 first recalls known facts on the 
application of SRHTs to orthogonal matrices and then presents new results on the application of 
SRHTs to general matrices and the approximation of matrix multiplication using SRHTs under 
the Frobenius norm. Section 5 contains the proofs of our two main theorems presented in Sec- 
tions 2 and 3. We conclude the paper with an experimental evaluation of the SRHT low-rank 
approximation algorithm in Section 6. 

1.1 Preliminciries 

Wc use A, B, ... to denote real matrices and a, b, ... to denote real column vectors. I„ is the nxn 
identity matrix; Omxn is the mxn matrix of zeros; ej is the standard basis (whose dimensionality 
will be clear from the context). A(j) denotes the ith row of A; A^-'^ denotes the jth column of A; 
Aij denotes the (i, j)th element of A. We use the Frobenius and the spectral norm of a matrix: 

II-'^IIf = y^Sij -A-fj and ||A||2 = maxx:||x||2=i ||Ax||2, respectively. The notation ||A||^ indicates 
that an expression holds for both ^ = 2 and ^ = F. 

A (compact) Singular Value Decomposition (SVD) of the matrix A G i^^^x" with rank(A) = p 
is a decomposition of the form 

where the singular values of A are ordered ai > . . .a^ > Ufc+i > . . . > CTp > 0. Here A; is a 

parameter in the interval 1 < k < p and the above formula corresponds to a partition of the 
SVD in block form using k. Wc denote the ith. singular value of A by ai (A) and sometimes 
refer to ai as cJmax and dp as (Tmm- The matrices Ufc G ^mxk Up_fc G M"*^^''"'^') contain the 
left singular vectors of A; similarly, the matrices G M"^*' and Vp_fc G M"'^(''~*^) contain the 
right singular vectors of A. We denote A^ = U,fcSjfcVj G R™"^"". Aj. minimizes || A — X||g over 
all m X n matrices X of rank at most k. A^ = VaS^^UJ G R^^™- denotes the Moorc-Penrose 
pseudo-inverse of A G M™'''". Let X G M"'''" (n > m) and B = XX^ G M™^™; any matrix that 
can be written in this form is called a symmetric positive semidefinite (SPSD) matrix. For all 
1 = 1, ...,m, Xi (B) = af (X) denotes the ith eigenvalue of B. We sometimes use Amin (B) and 
Amax (B) to denote the smallest (nonzero) and largest eigenvalues of B, respectively. 
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2 Low-rank matrix approximation using SRHTs 

Using an SRHT matrix (see Definition 2), one can quickly construct a low-rank approximation 
to a given matrix A. Our main result, Theorem 4 below, provides theoretical guarantees on the 
spectral and Probenius norm accuracy of these approximations. 

Theorem 4. Let A G W^^^ with rank p and n is a power of 2. Fix an integer k satisfying 
2 < k < p. Let < e < 1/3 be an accuracy parameter, < S < 1 be a failure probability, and 
C > 1 be any specified constant. Let 

Y = A0T, 

where & G W^^ is an SRHT with r satisfying r < m and 



2^-1 



+ ^/8ln{n/S)] hi{k/d) < r < n. 



(1) 



Furthermore, let Q G j^^x** satisfy Q^Q = Ir and be such that the column space ofY is contained 
in the range o/Q (e.g. such a Q can be computed with the QR factorization in 0{mr'^) arithmetic 
operations), and let A^ G W^^'^ be given by 



Afc = QX, 



where 



^opt — 



opt J 



argmin ||Q A — X||f. 
xeM'-x", rank(x)<ifc 



Given this setup, with probability at least 1 — S^^ in(A;/(5)/4 _ following Frobenius norm 

bounds hold simultaneously: 



||A - YY^AIIf < (1 + 22e) ■ || A - AfcUp, 
||A - AfellF < (1 + 22e) ■ II A - AjtUp, 
|Afe - YYtAllF < (1 + 22e) ■ || A - Ak\\^, 
||Afe - AfcllF < (2 + 22£) • II A - AjfcllF. 



(i) 

(ii) 
(iii) 

(iv) 



Similarly, the same setup ensures that with probability at least 1 — 55, the following spectral 
norm bounds hold simultaneously: 



||A- YY1'A||2 < 1^4 + 
||A- Afe||2 <(^ + 
|Ajfc-YYtA||2 < I 4 + 



< 7 + 



?>\n{n/5)\u{p/5) 



&\n{n/d)\n{p/d) 



31n(n/5)ln(p/5) 



121n(n/(5) \n{p/5)\ 



|A- Ajt||2 + 
|A- Ajt||2 + 
|A- Ajt||2 + 
||A- Afe||2 + 



31n(p/<5) 
r 

61n(p/(5) 
r 

3 1n(p/(5) 
r 

61n(p/(5) 
r 



|A- A 



fe F, 



|A- A 



fellF, 



|A- A 



lA- A, 



fellF- 



(v) 
(vi) 
(vii) 
(viii) 



The matrix Y can be constructed in at most 2mnlog2(r + 1) arithmetic operations, the matrix 
Y(Y^^A) in O (mnr) arithmetic operations, and the matrix A^ in 0(mnr) arithmetic operations 
as well. 
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We prove this theorem in Section 5.2. Notice that the Theorem provides residual and forward 
error bounds for two low rank matrices in the spectral and Frobcnius norms. The matrix YY^A 
has rank at most r > k, while the matrix has rank at most k. Prior works have provided only 
residual error bounds [45, 27, 37]. 

The first two Probenius norm bounds in this theorem (residual error analysis) are slightly 
stronger than the best bounds appearing in prior efforts [37]. The spectral norm bounds on the 
residual error are significantly better than the bounds presented in prior work and shed light on 
an open question mentioned in [37]. We do not, however, claim that the error bounds provided 
are the tightest possible. Certainly the specific constants (22, 6, etc.) in the error estimates are 
not optimized. 

We now present a detailed comparison of the guarantees given in Theorem 4 with those 
available in the existing literature. 



2.1 Detailed Compcirison to Prior Work 

Halko et al. [27] . To put our result into perspective, we compare it to prior efforts at analyzing 
the SRHT algorithm introduced above. Halko et al. [27] argue that if r satisfies 

4 Vk + ^8 ln{kn) ^ ln{k) < r < n, (2) 

then, for both C = 2, F, 

II A - YYtAII^ < (l + ^7^) • ||A - Afell^, 

with probability at least 1 — 0(1/A;). Our first Probenius norm bound is always tighter than the 
Frobcnius norm bound given here. To compare the spectral norm bounds, note that our first 
spectral norm bound is on the order of 



If the singular values of A are flat and A has close to full rank, then the spectral norm result 
in [27] is perhaps optimal. But in the cases where it makes most sense to ask for low-rank 
approximations — viz., A is rank-deficient or the singular values of A decay fast — the spectral 
error norm bound in Theorem 4 is more useful. Specifically, if 



/ 71 

then when r is chosen according to Theorem 4 the quantity in Eqn. (3) is much smaller than 
\/7n/r ■ \\A - Akh- 

We were able to obtain this improved bound by using the results in Section 4.1, which allow 
one to take into account decays in the spectrum of A. Finally, notice that our theorem makes 
explicit the intuition that the probability of failure can be driven to zero independently of the 
target rank k by increasing the number of samples r. 
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Nguyen et al. [37] . A tighter analysis of the Probenius norm error term of the SRHT low- rank 
matrix approximation algorithm appeared in Nguyen et al. [37]. Let 6 he a probability parameter 
with < 6 < 1 and e be an accuracy parameter with < £ < 1. Then, Nguyen et al. show that 
in order to get a lank-k matrix satisfying 

||A- AfellF <(! + £)• ||A- AjtllF 

and 

||A - Akh < (2 + v^W^) • ||A - Afc||2 
with probability of success at least 1 — 5(5, one requires 

r = ft (^s"^ max{fe, ■\/khi{2n/5)} ■ max{ln /c, ln(3/(5)}^ . 

Theorem 4 gives a tighter spectral norm error bound in the cases most of interest, where ||A — 



Afellr ^ Y in(p/(5) ■ 11-^ ~ Afc||2. It also provides an equivalent Frobenius norm error bound with a 
comparable failure probability for a smaller number of samples. Specifically, if 

r > 528e-^[Vk + ^/W^8n/6)fln{8k/6) = Q, (e"^ max{A:, ln(n/5)} • max{ln A;, ln(l/5)}) , 

then the second Frobenius norm bound in Theorem 4 ensures || A — AjtUp < (1 + e) • || A — AjtUp, 
with probability at least 1 — 86. 

In the conclusion of [37], the authors left as a subject for future research the explanation of 
a curious experimental phenomenon: when the singular values decay according to power laws, 
the SRHT low-rank approximation algorithm empirically achieves relative-error spectral norm 
approximations. Our spectral norm result provides an explanation of this phenomenon: when 
the singular values of A decay fast enough, as in power law decay, one has ||A — A^Hf = 6 (1) • 
||A — Afe||2. In this case, by choosing r 



24£-^ 



Vk + ^y8ln{n/6) ln{k/6) ln{n/S) <r<n 



our second spectral norm bound ensures ||A — Afc||2 < O (1) • ||A — Ak\\2 with probabihty of at 
least 1 — 86, thus predicting the observed empirical behavior of the algorithm. 



The subsampled randomized Fourier transform (SRFT). The algorithm in Section 5.2 
of [45], which was the first to use the idea of employing subsampled randomized orthogonal 
transforms to compute low rank approximations to matrices, provides a spectral norm error 

bound but replaces the SRHT with an SRFT, i.e. the matrix H of Definition 2 is replaced by 
a matrix where the {j,h)th entry is Hjh = e^'^^^^^^^'^^^^^^^^ , where i = i.e. H is the 

unnormalized discrete Fourier transform. Woolfe et al. [45] (see eqn. 190) argue that, for any 
a > 1, /3 > 1, if 

r>a^P{a-iy^ {2kf, 
then with probability at least 1 — S/P {u = max{m, n}), 

||A-UfcEfcVfc||2 < 2 (V2a -1 + 1) • {Vau + 1 + ^) ■ ||A- Afc||2. 

Here, XJk e R™^^ contains orthonormal columns, as does Vfc G R"^'^, while G R'^'^'^ is 
diagonal with nonegative entries. These matrices can be computed deterministically from A0^ 
in O (A;^(m + n) + kr^ Inr) time. Also, computing Y = A©""" takes O (mn In r) time. 
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Two alternative dimensionality-reduction algorithms. Instead of using an SRHT matrix, 
one can take in Theorem 4 to be a matrix of i.i.d standard Gaussian random variables. 
One gains theoretically and often empirically better worse-case trade-offs between the number 
of samples taken, the failure probability, and the error guarantees. The SRHT algorithm is still 
faster, though, since matrix multiplications with Gaussian matrices require 0{mnr) time. One 
can also take 0"*" to be a matrix of i.i.d. random signs (±1 with equal probability). In many ways, 
this is analogous to the Gaussian algorithm — in both cases is a matrix of i.i.d. subgaussian 
random variables — so we expect this algorithm to have the same advantages and disadvantages 
relative to the SRHT algorithm. We now compare the best available performance bounds for 
these schemes to our SRHT performance bounds. 
We use the notion of the stable rank of a matrix, 



sr(A) = ||A||| 



2 /II A ||2 
'■112) 



to capture the decay of the spectrum of A (spectrum here refers to the singular values of A) . As 
can be seen by considering a matrix with a flat spectrum, in general the stable rank is no smaller 
than the rank; the smaller the stable rank, the more pronounced the decay in the spectrum of A. 
When r > k + 4:, Theorem 10.7 and Corollary 10.9 in [27] imply that, when using Gaussian 

sampling, with probability at least 1 — 2 • 32 — e 2 , 



I A - YYtAllp < 1 + 32 ^ ^^ + ^^ . II A - A,||p 
' " I Vr-k + l " 



and with probability at least 1 — 3e 



I A - YVAlb < 1 + + -1^ . II A - A.|b + . ||A - A 



Comparing to the guarantees of Theorem 4 we see that these bounds suggest that with the 

same number of samples, Gaussian low-rank approximations outperform SRHT low-rank ap- 
proximations. In particular, the spectral norm bound guarantees that if sr (A — A/.) < k, i.e. 
II A — Afcllp < \/A;||A — Afe||2, then the Gaussian low-rank approximation algorithm requires 
0{k/e'^) samples to return a (17-l-e) constant factor spectral norm error approximation with high 
probability. Similarly, the Frobenius norm bound guarantees that the same number of samples 
returns a (1 + 32e) constant factor Frobenius norm error approximation with high probability. 
Neither the spectral nor Frobenius bounds given in Theorem 4 for SRHT low-rank approximations 
apply for this few samples. 

[35] does not consider the Frobenius norm error of the random sign low-rank approximation 
algorithm, but Remark 4 in [35] shows that when r = 0(A;/e^ ln(l/(5)), for 1 < S < 0, and 
sr (A — Afc) < k, this algorithm ensures that with high probability of at least 1 — S, 

||A - YYtA||2 < (1 + £)||A - Afe||2. 

To compare our results to those stated in [27, 35] we assume that k S> ln(n/(5) so that r > A; In A; 
suffices for Theorem 4 to apply. Then, in order to acquire a (A + e) relative error bound from 
Theorem 4, it suffices that (here C is an explicit constant no larger than 6) 

r > C'e-'^k \n{p/5) and sr (A - A^) < C'k. 
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We see that the Gaussian and random sign approximation algorithms return (17 + e) and 

(1 + e) relative spectral error approximations, respectively, when r is on the order of k and the 
relatively weak spectral decay condition sr (A — A^) < A; is satisfied, while our bounds for the 
SRHT algorithm require r > kln{p/6) and the spectral decay condition 

sr (A - Afc) < C'k 

to ensure a (6 + e) relative spectral error approximation. We note that the SRHT algorithm can 
be used to obtain relative spectral error approximations of matrices with arbitrary stable rank at 
the cost of increasing r (the same is of course true for the Gaussian and random sign algorithms). 

The disparity in the bounds for these three schemes — the presence of the logarithmic factors 
in the SRHT bounds and the fact that these bounds apply only when r > k\n{p/S) — may reflect 
a fundamental trade-off between the structure and randomness of . The highly structured 
nature of SRHT matrices makes it possible to calculate Y much faster than when Gaussian or 
random sign sampling matrices are used, but this moves us away from the very nice isotropic 
randomness present in the Gaussian ©""^ and the similarly nice properties of a matrix of i.i.d 
subgaussian random variables, thus resulting in slacker bounds which require more samples. 

3 Least squares regression 

We now show how one can use the SRHT to solve least squares problems of the form 

min II Ax — b||2. 

X 

Here A is an m x n matrix with m S> n and rank(A) = n, b G W", and x G M". One approach 
to solve this optimization problem is via the SVD of A: Xopt = A^b; while an example of an 
iterative algorithm is the LSQR algorithm in [38]. 

During the last decade, researchers have developed several randomized algorithms that (ap- 
proximately) solve the regression problem in less running time than the approaches mentioned 
above [42, 40, 35, 3, 19]. We refer the reader to Section 3.3 in [4] for a survey of these methods. 
The fastest non- iterative method is in [19] while the fastest iterative algorithm is in [40, 3]. Both 
approaches employ the Subsampled Randomized Hadamard Transform. 

3.1 Least-squeires via the SVD 

The idea in the SRHT algorithm of Drineas et al. [19] is to reduce the dimensions of A and b by 
pre- multiplication with an SRHT matrix G R''^"* (the matrix R in this case is constructed by 
uniform sampling without replacement) and then solve quickly the smaller problem, 

min ||0Ax — 0b||2. 

X 

Let Xopt = (0A)^0b; then, assuming r satisfies (e > is an accuracy parameter) 

r = max{48^n ln(40mn) ln(10^nln(40mn)), 40nln(40mn)/£}, 
[19] shows that with probability at least 0.8, 

II Axopt - b||2 < (1 + e) • II Axopt - b||2. 
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Furthermore, assume that there exists a 7 G (0,1] such that ||UAUjb||2 = 7||b||2. Then, with 
the same probabiUty, 



The running time of this approximation algorithm is O (mn log2r' + rn^), since the SRHT mul- 
tiphcation takes O {mn log2 r) time and the solution of the small regression problem another 
O (rn2). 

Below, we provide a novel analysis of this SRHT least squares algorithm which shows that one 
needs asymptotically fewer samples r. This immediately implies an improvement on the running 
time of the algorithm. Additionally, we show logarithmic dependence on the failure probability. 

Theorem 5. Let A G ]K»"X" (m ^ n) have rank p = n and n be a power of 2; let b € R™. Let 
< e < 1/3 denote an accuracy parameter, Q < 5 < I he a failure probability, and C > 1 be a 
constant. Let & be an r x m SRHT matrix with r satisfying 



-1 2 

\/n + V 8 ln(m/(5) ln(n/(5) <r <m. 



Then, with probability at least 1 - (^C^ in(n/5)/4 _ 

II Axopt - b||2 < (1 + 22£) • II Axopt - b||2. 

Furthermore, assume that there exists 076 (0,1] such that ||UAUjb||2 = 7||b||2. Then, with 
the same probability, 

1 

||Xopt - Xopt II2 < (^"~^~^ (^k(A)a/7-2 - 1^ ||Xopt||2. 

We prove this theorem in Section 5.3. Another possibility to obtain a better analysis of the 
method of Drineas et al. is to use Lemma 10 in this article, which was proved in [29] and presents 
bounds for sampling without replacement. This analysis is not straightforward and is beyond the 
scope of this paper. 



3.2 Iterative methods 

The key idea of an iterative algorithm such as the LSQR method of [38] is preconditioning. 
Blendenpik in [3] constructs such a preconditioner by using the SRHT (the matrix R in this 
case is constructed by uniform sampling without replacement) as follows. First, an SRHT matrix 
€ W^^"^ is constructed. Then, one forms a QR factorization 0A = QRa, with Q G M*"^" and 
Ra G M"^". Finally, A and Ra are given as inputs to LSQR to find a solution to the least squares 
problem. We refer the reader to [3] (see also [29]) for a detailed discussion of this approach. The 
purpose of our discussion here is to comment on the first step of the above procedure and show 
that a preconditioner of the same quality can be constructed with a smaller r. Avron et al. [3] 
argue that if the number of samples is sufficiently large then the two-norm condition number of 
AR~^ is small. A small condition number is desirable because the number of iterations required 
for convergence of the LSQR method is proportional to the condition number. More specifically, 
Theorem 3.2 in [3] argues that with r = (nln(m) ln(nln(m))), and with constant probability 
(e.g. 0.9), 
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The analysis of Blendenpik was recently improved in [29]. More specifically, Corollary 3.11 in [29], 
along with Lemma 7 in our manuscript, which gives a bound on the coherence, show that if 



+ ^/8ln{m/6)\ ln{2n/6) <r< 



then, with probability at least 1 — 2S, 



.(AR-) < 



l + e 
e' 



We now provide a similar bound in the case where the SRHT is constructed via sampling without 
replacement. This bound is a simple combination of results in prior work. More specifically. 
Theorem 1 in [40] argues that the two-norm condition number of AR^^ equals the two-norm 
condition number of U""^©^, where U G j^mxn QQ^^ains the top n left singular vectors of A. 
Combine this fact with the bounds on the singular values of U"'^©"'^ from Lemma 6, to obtain the 
following observation. 

Remark. Let A G j^mx" ^ j^) have rank p = n and n be a power of 2. Fix < S < 1 
and < e < 1/3. Construct the upper triangular matrix Ra G M"^" via the QR factorization 
©A = QRa, where © is an r x m SRHT matrix with r satisfying 



6£ 



-2 



+ ^8 ln(m/(5)l ln{2n/6) < r < m. 



Then, with probability at least 1 — 26, 



1 



Finally, notice that we form the SRHT by uniform sampling without replacement while Blenden- 
pik samples the columns of the randomized Hadamard matrix with replacement. A different 
sampling scheme - Bernoulli sampling - was analyzed in Theorem 6.1 in [24] and Section 4 in [29]. 



The subsampled randomized Fourier transform (SRFT). Finally, wc mention the work of 
Rokhlin and Tygert [40] , which was the first to use the idea of employing subsampled randomized 
orthogonal transforms to precondition iterative solvers for least squares regression problems. [40] 
replaces the SRHT with the SRFT; notice though that one still needs O (mn In r) time to compute 
the product ©A. In this case, for any a>l,0<(5<l, if 

then with probability at least I — S, 

K (AR^^) < a. 



4 Matrix Computations with SRHT matrices 
4.1 SRHTs applied to orthonormal matrices 

An important ingredient in analyzing the low-rank approximation algorithm of Theorem 4 is 
understanding how an SRHT changes the spectrum of a matrix after postmultiplication: given a 
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matrix X and an SRHT matrix 0, how are the singular values of X and X0^ related? To be 
more precise, Lemma 22 in Section 5.1.2 suggests that one path towards establishing the efficacy of 
SRHT-based low-rank approximations lies in understanding how the SRHT perturbs the singular 
values of orthonormal matrices. To see this, we informally repeat the statement of the lemma 
here. Let A G M"^^" have rank p. Fix k satisfying < < p. Given a matrix f2 G M"^'', with 
r > k, construct Y = Aft. If V^fi has full row rank, then, for ^ = 2, F, 

II A - YYtAlll < II A - Afcllf + ||S,_fcVj_,n(vTn)^|||. (4) 

Now take CI = &^ and observe that if the product Sp_fcVj_^.0"'" (V^©"'")^ has small norm, then 
the residual error of the approximant YY^ A is small. The norm of this product is small when the 
norms of the perturbed orthonormal matrices and (V^0^)^ are in turn small, because 

\\^,~kVU®^{VI@^y\\l < I|l^p-.|||||VJ_,0^|||||K0^)^|||. (5) 

These perturbed orthogonal matrices have small norm precisely when their singular values are 
close to those of the original orthogonal matrices. 



4.1.1 SRHTs by uniform sampling without replacement 

In this section, we collect known results on how the singular values of a matrix with orthonormal 
rows are affected by postmultiplication by an SRHT matrix. 

It has recently been shown by Tropp [43] that, if the SRHT matrix is of sufficiently large 
dimensions, post-multiplying a short-fat matrix with orthonormal rows with an SRHT matrix 
preserves the singular values of the orthonormal matrix, with high probability, up to a small 
multiplicative factor. The following lemma is essentially a restatement of Theorem 3.1 in [43], 
but we include a full proof (later in this subsection) for completeness. 

Lemma 6 (The SRHT preserves geometry). Let V G M"^'^ have orthonormal columns and n he 
a power of 2. Let < e < 1/3 and Q < 8 < 1. Construct an SRHT matrix G W"^'^ with r 
satisfying 

6e"^ \Vk + A/81n(n/(5)j \n{k/6) < r < n. (6) 
Then, with probability at least 1 — 36, for all i = 1, ...,k, 

and 

Tropp [43] (see also [1]) argues that the above lemma follows from a more fundamental fact: 

if V has orthonormal columns, then the rows of the product HDV all have roughly the same 
norm. That is, premultiplication by HD equalizes the row norms of an orthonormal matrix. 

Lemma 7 (Row norms. Lemma 3.3 in [43]). Let V G M"^^ have orthonormal columns (n is a 
power of 2), H G M"^" he a normalized Hadamard matrix, D G M"^" 6e a diagonal matrix of 
independent random signs, and < (5 < 1 he a failure probability. Recall that (HDV) denotes 
the ith row of the matrix HDV G M"^'^. Then, with probability at least 1 — 5, 



„ k /81n(n/5) 

maxi=i,...,„ II (HDV)(,) h < \l - + ■ ' ^ ' ' 



n V n 
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To prove Lemma 6 we need one more result on uniform random sampling (without replace- 
ment) of rows from tall-thin matrices with orthonormal columns. 

Lemma 8 (Uniform Sampling without replacement from an Orthonormal Matrix, Corollary to 
Lemma 3.4 of [43] ). Let W G W^^^ have orthonormal columns. Let < e < 1 and < (5 < 1. 
Let M := n- maxj=i^...^„ l|W(j)||2. Let r be an integer such that 

6e-^Mln{k/S) <r <n. (7) 

Let R G M**^" be a matrix which consists of a subset of r rows from In where the rows are chosen 
uniformly at random and without replacement. Then, with probability at least 1 — 26, for i G [A;] : 

■ Vl^e < aiCRW) < Vl^e ■ J -. 
n y n 

Proof. Apply Lemma 3.4 from [43] with the following choice of parameters: i = aMln{k/S), 
a = and Stropp = rj = e. Here, I, a, M, k, rj are the variables of Lemma 3.4 from [43] (we 
also use M and k), and Stropp plays the role of S, an error parameter, of Lemma 3.4 from [43]. 
The variables e and 5 are from our Lemma. The choice of i proportional to ln{k/S) rather than 
proportional to ln(/c), as in the original statement of Lemma 3.4, is what results in a probability 
proportional to 6 instead of k; this can easily be seen by tracing the modified choice of £ through 
the proof of Lemma 3.4. ■ 

Proof, (of Lemma 6) To obtain the bounds on the singular values, we combine Lemmas 7 and 8. 
More specifically, apply Lemma 8 with W = HDV and use the bound for M from Lemma 7. 
Then, the bound on r in Eqn. (7), the bound on the singular values in Lemma 8, and the union 
bound, establish that with probability at least 1 — 36, 



V n V 



Now, multiply this inequality with y/njr and recall the definition = • RHD to obtain 

Vl - e < 0-^(0 V) < Vl + e. 

Replacing e with -y/e and using the bound on r in Eqn. (6) concludes the proof. 

The second bound in the lemma follows from the first bound after a simple algebraic manipu- 
lation. Let X = V'^0T G M'^'^'" with SVD X = UxSxV^. Here, Ux G ^^""^ , Sx G M'^^^, and 
Vx G M'■^^ since r > k. Consider taking the SVDs of (V'^©'^)"^ and (V'^0'^)'^, 

||(vT0T)t_(vT0T)T||2 = ||VxSx'U^-VxSxU^||2 = ||Vx(Sx'-Sx)U^||2 = ||Sx'-Sx||2, 

since Vx and Ux can be dropped without changing the spectral norm. Let Y = — Ex G 
j^fcxfc Then, for alH = 1, . . . , A;, Yjj = ^"^.^^ ■ We conclude the proof as follows, 



lYIh = maxi<i<jfc lYiil = maxi<i<fc ^ 



cTi(X) 



l-a2(X) J-E 
maxi<j<fe -i — — < < 1.54^/£. 



12 



4.1.2 SRHTs by uniform sampling with replacement 



Notice that Lemma 6 and Lemma 8 analyze uniform random samphng without replacement. Be- 
low, we present the analogs of these two lemmas for uniform random sampling with replacement. 

Lemma 9 is essentially a restatement of Algorithm 2 (with the probabilities set to 1/m) along 
with the third point in Remark 3.9 and Lemma 2.1 (with a = ■\/riJr) in [29]. 

Lemma 9 (Uniform Sampling with replacement from an Orthonormal Matrix [29] ). Let W € 
]^nxfe /jfj^g orthonormal columns. Let < e < 1 and < (5 < 1. Let M := n • maxj=i_...^„ 
Let r be an integer such that 

^e-^Mln{k/5) <r <n. (8) 

Let R € R*"^" be a matrix which consists of a subset of r rows from I„ where the rows are chosen 
uniformly at random and with replacement. Then, with probability of at least 1 — 25, for i G [k] : 



e. 



Lemma 10 (The SRHT preserves geometry). Let V G M"^^ have orthonormal columns and n 
be a power of 2. Let < e < 1 and Q < 5 < 1. Construct an SRHT matrix € W^"^ (R is 
constructed as in Lemma 9, i.e. via uniform random sampling with replacement) with r satisfying 
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Vk + v/81n(n/(5)l ^ \n{k/S) < r < n. (9) 



Then, with probability at least 1 — 36, for all i = 1, ...,k, 

^1-x/i < ai (V^e^) < ^/lW^ 

Proof. To obtain the bounds on the singular values, we combine Lemmas 7 and 9. More specif- 
ically, apply Lemma 9 with W = HDV and use the bound for M from Lemma 7. Then, the 
bound on r in Eqn. (7), the bound on the singular values in Lemma 9, and the union bound, 
establish that with probability of at least 1 — 3(5, 

Vl^ < (Ti ■ RHDvj < Vl^e. 

Replacing e with y/e and using the bound on r in Eqn. (7) concludes the proof. 



4.2 SRHTs applied to general matrices 

The structural result in Lemma 22, Lemma 6 on the pcrturbative effects of SRHTs on the singular 
values of orthonormal matrices, and the basic estimate in (5) are enough to reproduce the results 
on the approximation error of SRHT-based low-rank approximation in [27]. The main contri- 
bution of this paper is the realization that one can take advantage of the decay in the singular 
values of A encoded in to obtain sharper results. In view of the fact that 

l|Sp-.Vj_,0TK0T)^||| < \\^,-,Vl_,@m\\{Yl&^)\\l (10) 
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we should consider the behavior of the singular values of 'Sp-k\^J_j^&'^ instead of those of 
Vj_j;.0"'". Accordingly, in this section we extend the analysis of [43] to apply to the applica- 
tion of SRHTs to general matrices. 

Our main tool is a generalization of Lemma 7 that states that the maximum column norm of 
a matrix to which an SRHT has been applied is, with high probability, not much larger than the 
root mean-squared average of the column norms of the original matrix. 

Lemma 11 (SRHT equalization of column-norms). Suppose that A is a matrix with n columns 
and n is a power of 2. Let H G M"^" he a normalized Walsh- Hadamard matrix, and D G M"^" 
a diagonal matrix of independent random signs. Then for every t >0, 



max, 



II (ADHT)« ||2<^||A||f 



t 



n 



> 1 — n ■ e 



-1^1% 



Proof. Our proof of Lemma 11 is essentially that of Lemma 7 in [43], with attention paid to 
the fact that A is no longer assumed to have orthonormal columns. In particular, the following 
concentration result for Lipschitz functions of Rademachcr vectors is central to establishing the 
result. Recall that a Rademacher vector is a random vector whose entries are independent and 
take the values ±1 with equal probability. 

Lemma 12 (Concentration of convex Lipschitz functions of Rademacher random variables [Corol- 
lary 1.3 ff. in [30]] ). Suppose f is a convex function on vectors that satisfies the Lipschitz bound 

l/(x) - /(y)| < -^^||x-y||2 forall:iL,Y. 

Let £ he a Rademacher vector. For all t > 0, 

P[/(£)>E [/(£)] + Lt]<e-*'/«. 

Lemma 11 follows immediately from the observation that the norm of any one column of 
ADH^ is a convex Lipschitz function of a Rademacher vector. Consider the norm of the jih 
column of ADH"'^ as a function of e, where D = diag (e) : 

fj{e) = IIADH'^ejll = ||Adiag (e) hj||2 = ||Adiag (hj) £||2, 

where hj denotes the jth column of . Evidently fj is convex. Furthermore, 
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-/j(y)| < ||Adiag(hj)(x-y)||2 < || A||2||diag (hj) ||2||x-y||2 = ^||A||2||x 



where we used the triangle inequality and the fact that ||diag(hj) ||2 = ||hj||oo 
convex and Lipschitz with Lipschitz constant at most ||A||2/-v/n- 
We calculate 



E [/,(£)] <E[/,(£ 



Tr ( Adiag (hj) E [ee*] diag (h^)^ A^ 



1 



1/2 



y||2, 

Thus fj is 



Tr 
"l 



AA^ 



n 

A||f. 



.1/2 
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It now follows from Lemma 12 that, for all j = 1,2, ... ,n, the norm of the jth column of 
ADH""" satisfies the tail bound 



|ADH^ej||2 > 



A||f + ^||A||2 



< e"* 



Taking a union bound over all columns of ADH"^, we conclude that 



maxj=i,...,„ 



(ADHT)(^')||9 > illAllF + ^IIAI, 



< n ■ e 



As an interesting aside, we note that just as Lemma 6, which states that the SRHT essentially 
preserves the singular value of matrices with orthonormal rows and an aspect ratio oikjn, follows 
from Lemma 7, Lemma 11 implies that the SRHT essentially preserves the singular values of 
general rectangular matrices with the same aspect ratio. This can be shown using, e.g., the 
results on the effects of column sampling on the singular values of matrices from [24, Section 6]. 

The following lemma shows that even if the aspect ratio is larger than kjn, the SRHT does 
not substantially increase the spectral norm of a matrix. 

Lemma 13 (SRHT-based subsampling in the spectral norm). Let A G M"*^" have rank p and 
n be a power of 2. For some r < n, let & E W^'" be an SRHT matrix. Fix a failure probability 
0<6 <1. Then, 



\A@^\\l < 5||A||^ + 



Hp/6) 



\F + ^8ln{n/6)\\A\\2y 



> 1 - 2(5. 



To establish Lemma 13, we use the following Chernoff bound for sampling matrices without 

replacement. 

Lemma 14 (Matrix Chernoff bound, Theorem 2.2 in [43]; see also Corollary in [44]). Let X be 

a finite set of positive-semidefinite matrices with dimension k, and suppose that 

maxAmax (X) < B. 

Sample {Xi, . . . , X^} uniformly at random from X without replacement. Compute 



r ■ 



(E[Xi]). 



Then 



Amax . Xj j > (1 + i^)p^ 



<k- 



' /imax/-B 



,1+1/ 



for u >0. 



Proof of Lemma 13. Write the SVD of A as USV^' where S G W^f and observe that the 
spectral norm of A©"^ is the same? as that of SV^©"^. 

We control the norm of SV^©^ by considering the maximum singular value of its Gram 
matrix. Define M = 5:v'^DH'^ and let G be the Gram matrix of MR'"' : 



G = MR^ (MR^)"^. 



Evidently 



Amax(G) = -||EVT©T||2. 



n 



(11) 
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Recall that M^-') denotes the jth column of M. If we denote the random set of r coordinates 
to which R restricts by T, then 

Thus G is a sum of r random matrices Xi , . . . , sampled without replacement from the set 
X = {M(-')(M(^)) : J = 1,2, There are two sources of randomness in G: R and the 

Rademacher random variables on the diagonal of D. 
Set 



B = 

n 



^ (||s||f + VsM^llsib)^ 



and let E be the event 

max,=i,...,„||M(^')||i<5. 
When E holds, for all j = 1,2, . . . ,n, 

Ama. (m(^)(M(^))''^) = ||M(^)||i < B, 

so G is a sum of random positive-semidefinite matrices each of whose norms is bounded by B. 
Note that whether or not E holds is determined by D, and independent of R. 

Conditioning on E, the randomness in R allows us to use the matrix Chernoff bound of 
Lemma 14 to control the maximum eigenvalue of G. We observe that 

Aimax = r • A„,a. (E [Xi]) = ^A^ax (j^.^i M^^) (M^^)) ^) = 

Take the parameter v in Lemma 14 to be 

u = A-\ ln(/o/(5) 

to obtain the relation 

P [A^a. (G) > + B\n{p/d) \E]<ip-k)- e[5-(i+-)Mi+-)]'^ 

The second inequality holds because u > 4 implies that {1 + v) ln(l + v) > • | In 5. 

We have conditioned on E, the event that the squared norms of the columns of M are all 
smaller than B. By Lemma 11, E occurs with probability at least 1 — 6. Thus, substituting the 
values of B and ^max, we find that 



Amax (G) > ^ (^5||5:||i + ^""^ (||S||f + V81n(n/,5)||S||2)'j 



< 26. 



Use equation (11) to wrap up. 

Similarly, the SRHT is unlikely to substantially increase the Probenius norm of a matrix. 
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Lemma 15 (SRHT-based subsampling in the Frobenius norm). Let A G M"*^" (n is a power of 
2) and let G W^" be an SRHT matrix for some r < n. Fix a failure probability < (5 < 1. 
Then, for any rj >0, 



|A0T|||<(l + r,)||A|||] >1- 



-6. 



Proof. Let c,- 



ADH^)j||2 denote the squared norm of the jth column of ^fnfr ■ ADH^ 



Then, since right multiphcation by R samples columns uniformly at random without replace- 
ment, 

||A0T||2 = ^||ADHTrT||2 = Yli=r^^ (12) 

where the random variables Xi are chosen randomly without replacement from the set {cj}^^^. 
There are two independent sources of randomness in this sum: the choice of summands, which is 
determined by R, and the magnitudes of the {cj}, which is determined by D. 

To bound this sum, we first condition on D being such that each Cj is bounded by a quantity 
B. Call this event then 

p Li ^ (1 + ELi E [^d] < p [ELi < (1 + ^) E Li E \n \e-\ . 

To select we observe that Lemma 11 implies that with probability 1 — 5, the entries of D are 
such that 



maxj Cj < 



n 1 



r n 



|A||f + V81n(n/5)||A||2)2 < -(1 + ^S\u{n/ 5)f\\A\\l. 



Accordingly, we take 



B = ^(1 + V81n(n/(5))'||A|||, 
thereby arriving at the bound 



+ 6. (13) 



After conditioning on D, we observe that the randomness remaining on the righthandside 
of Eqn. (13) is the choice of the summands Xj, which is determined by R. We address this 
randomness by applying a scalar Chernoff bound (Lemma 14 with k = 1). To do so, we need 
A*max, the expected value of the sum; this is an elementary calculation: 



E 



I Al|2 



so //max = [Xi] 

Applying Lemma 14 conditioned on we conclude that 



A0T||2 > (i + r^)||A||^ I E] < 



(1 + 77)1+". 



r/(l+V81n(n/5))2 



+ s 



for 77 > 0. ■ 

Finally, we prove a novel result on approximate matrix multiplication involving SRHT matrices. 
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Lemma 16 (SRHT for approximate matrix multiplication). Let A G W^^"^ , B G W^^'p , and n 
he a power of 2. For some r < n, let @ E W^'"^ be an SRHT matrix. Fix a failure probability 
< 6 < 1. Assume R satisfies < R < , Then, 



lAe-eB - abiip < 2(r + ^^WMm. + ^mmmmh 



> 1 - e-"''" - 24 



Remark. Recall that the stable rank sr (A) = ||A||p/||A||2 reflects the decay of the spectrum 
of the matrix A. Lemma 16 can be rewritten as a bound on the relative error of the approximation 
A0'^0B to the product AB : 

||A0'^GB - AB||f ^ ||A||f||B||f R + 2 ( ^ ^%\n{n/5)\ 



||AB||f " ||AB||f \/r \ sr(B) /' 

In this form, we sec that the relative error is controlled by the deterministic condition number 
for the matrix multiplication problem as well as the stable rank of B and the number of column 
samples r. Since the roles of A and B in this bound can be interchanged, in fact we have the 
bound 

||A0'^0B - AB||f ^ ||A||f||B||f R + 2 ( ^JS\n{n/5) \ 

||AB||f - ||AB||f ^ max(sr(B),sr(A)) ) ' 



Proof of Lemma 16 

To prove the Lemma, we first develop a generic result for approximate matrix multiplication via 
uniform sampling (without replacement) of the columns and the rows of the two matrices involved 
in the product (see Lemma 18 below). Lemma 16 is a simple instance of this generic result. We 
mention that Lemma 3.2.8 in [14] gives a similar result for approximate matrix multiplication, 
which, however gives a bound for the expected value of the error term, while our Lemma 16 gives 
a comparable bound which holds with high probability. To prove Lemma 18, we use the following 
vector Bernstein inequality for sampling without replacement in Banach spaces; this result follows 
directly from a similar inequality for sampling with replacement established by Gross in [25] . 

Lemma 17. Let V be a collection ofn vectors in a normed space with norm \-\ . Choose Vi, . . . , "V^ 
from V uniformly at random without replacement. Also choose V/, . . . , V^' from V uniformly at 
random with replacement. Let 



and set 



IfO<t< a^/B, then 



> 4rE 



IKI 



and B > 2 max |V| 

V6V 



>l^ + t 



< exp 



4^2 



Proof. We proceed by developing a bound on the moment generating function (mgf) of 



I ^ ^1 = 1 



- II. 
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This mgf is controlled by the mgf of a similar sum where the vectors are sampled with replacement. 
That is, for A > 0, 



E 



exp (a • \Y,]^^ Vi - rE [Fi] | - A^)] < E [exp (a • \Y^[^^ F/ - rE [F] 



(14) 



This follows from a classical observation due to Hoeffding [28] (see also [26] for a more modern 
exposition) that for any convex M-valued function 5, 



Specifically, take g{V) = exp (A |F — rE ["Vi]| — A/i) to obtain the asserted inequality of mgfs. 

In the proof of Theorem 12 in [25], Gross establishes that any random variable Z whose mgf 
is less than the righthand side of Eqn. (14) satisfies a tail inequality of the form 



P [Z > + i] < exp 



t 

"4? 



(15) 



when t < s'^/M, where 



>;^e[|v/-e[v/] 



i=l 



,r. To apply this result, note that for 



and M almost surely bounds |V^' — E [V{]\ for alH = 1 
alH = 1, . . . , r, 

\V' - E \Vl] I < 2 max \ V\ = B. 
Also take V{' to be an i.i.d. copy of V{ and observe that, by Jensen's inequality. 



^e[|f/-e[v/] 



i=l 



= rE 

< rE 

< 2rE 

= 4rE 



|F/-E[F/]| 

<rE [(|F/|+ |V/'|)2] 

\vlf+ \vl'\ 



The bound given in the statement of Lemma 17 follows from taking = and M = B in 
Eqn. (15). ■ 

This vector Bernstein inequality gives us a tail bound on the Frobenius error of a simple 
approximate matrix multiplication scheme based upon column and row sampling. 

Lemma 18 (Matrix Multiplication). Let X G "SJ^^"- and Y G M"^^. Fix r < n. Select uniformly 
at random and without replacement r columns from X and the corresponding rows from Y and 
multiply the selected columns and rows with y/n/r. Let X G M"*''^ and Y e contain the 

selected columns and rows, respectively. Choose 



^2>!!!y||^«||2||y- III B>—ma^\ 

r ^-^ > > \ r> r i 

i=l 



X»||2||1?,- 



(i)l|2- 



Then ifO<t< u'^/B, 



|XY-XY||f >t + a 



< exp 



4(72 
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Proof. Let V be the collection of vectorized rank-one products of columns of \fnjr ■ X and rows 
of \fnjr ■ Y. That is, take 



V={^vec(X«y-(,))} 



i=l 



Sample Vi, . . . , uniformly at random from V without replacement, and observe that E [V^] = 
^vec(XY). With this notation, the quantities ||XY — XY||f and 

||^;^^(y,-E[v,])||^, 

have the same distribution, therefore any probabilistic bound developed for the latter holds for 
the former. The conclusion of the lemma follows from applying Lemma 17 to bound the second 
quantity. 

We calculate the variance- like term in Lemma 17, 4rE [HViHl] : 

-I " 2 " 



n — ' r 

1=1 



j=i 



Now we consider the expectation 

/^ = e[||ELi(^/-e[v/]) 

In doing so, we will use the notation '^a,b,... [C] to denote the conditional expectation of a random 
variable C with respect to the random variables A^B, . . . . Recall that a Rademacher vector is 
a random vector whose entries are independent and take the values ±1 with equal probability. 
Let £ be a Rademacher vector of length r and sample V/, . . . , V^' and V/', . . . , Y^' uniformly at 
random from V with replacement. Now /it can be bounded as follows: 



= E 
< E 



wmv/'}[||i:L(^/-^/') 



< 2E 



< 



V^[5:;.iiv7iii]. 



The first inequality is Jensen's, and the following equality holds because the components of the 
sequence {Vi —Vf} are symmetric and independent. The next two manipulations are the triangle 
inequality and Jensen's inequality. This stage of the estimate is concluded by conditioning and 
using the orthogonality of the Rademacher variables. Next, the triangle inequality and the fact 
that E [llV/lll] = E [II Villi] allow us to further simplify the estimate of /x : 



/X < 2yE|^]^J^^ = 2^/rEOrt^ < ex. 
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We also calculate the quantity 

2max||V||2 = — max IIX^^^ lUliyri) II2 < B. 
The stipulated tail bound follows from applying Lemma 17 with our estimates for B, a^, and 

/X. ■ 

Lemma 16 now follows from this result on matrix multiplication. 

Proof, (of Lemma 16) Let X = ADH"*" and Y = HDB and form X and Y according to 
Lemma 18. Then, XY = AB and 

II A0^©B - AB||f = ||XY - XY||f. 

To apply Lemma 18, we first condition on the event that the SRHT equalizes the column norms 
of our matrices. Namely, we observe that, from Lemma 11, with probability at least 1 — 2(5, 

maxi ||X«||2 < ^(||A||f + ^/8ln{n/S)\\A\\2), and (16) 



maxi ||Y(^)||2 < 4r(l|E||F + ^8 ln(n/5)||B||2). 



Conditioning on these nice interactions, we choose the parameters a and B in Lemma 18. We 
first take 



2 



4. 



r 



|B||f + v'81n(n/5)||B||2)2||A||2. (17) 



Observe that because of (16), 



^,^^n (||Y||F + V8M^||YM! 2 > ,ny- ||X«||i||Y«||i 
so this choice of a satisfies the inequality stipulated in Lemma 18. Next we choose 

B = ^(||A||f + v'81n(n/(5)||A||2)(||B||F + ^S]^^)\\Bh). 
r 

Again, because of (16), B satisfies the stipulation B > ^ maxj ||XW||2||Y(j)||2. 
For simplicity, let 7 = 81n(n/(5). With these choices for cr^ and B, 

2||A|||(||B||f + V7||B||2)^ 
B (||A||f + V7||A||2)(||B||f + V7||B||2) 
^ 2||A|||(||B||f + V7||B||2)^ 
- (||A||f + V7||A||f)(||B||f + V7||B||2) 
^ 2||A||f(||B||f + V7IIBII2) 

Now, referring to Eqn. (17), identify the numerator as s/ra to see that 

^ s/ra 
^ ~ 1 + y81n(n/lj' 
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Apply Lemma 18 to see that, when Eqns. (16) hold and < Ra < a^/B, 

P[||A0'^©B- AB||f > {R+l)a] < exp ^-^^ . 

From our lower bound on a^/B, we know that the condition Ra < a^/B is satisfied when 

R<V^/il + V8Hn/S)). 

Also, we established above that Eqns. (16) hold with probability at least 1 — 26. From these two 
facts, it follows that when 0<R< ^/r/{l + ^y8ln{n/6)), 

P[||A0'^©B- AB||f > (it;+l)(7] <exp(^-^^ +26. 

The tail bound given in the statement of Lemma 16 follows from substituting our estimate of 
a. m 

5 Proofs of our main Theorems 

5.1 Preliminaries 

To prove Theorem 4 we first need some background on restricted (within a subspace) low-rank 
matrix approximations. Let A G M™^", let A; < n be an integer, and let Y G j^^x** with r > k. 
We call IIy ^(A) G M"*^" the best rank k approximation to A in the column space of Y, with 

respect to the ^ norm (^ = 2 or ^ = F). Formally, for fixed ^, we can write IIy k^-^) ~ ^X^, 
where 

X^= argmin ||A-YX|||. 

xeR'''<":rank(x)<fc 

In order to compute (or approximate) IIy fe(-^) ^^^^ following 3-step procedure: 

1: Orthonormalize the columns of Y in 0{mr'^) time to construct the matrix Q G M"*^^. 
2: Compute Xopt = argmin^^j^^xn rank(x)<A; II Q"^-'^ ~ -^IIf 0{mnr + nr"^) time. 
3: Return QXopt G M™-><" in 0{mnr) time. 

QXopf is a matrix of rank at most k that lies within the column span of Y. Note that though 
nY;;.(A) can depend on ^, the algorithm above computes the same matrix, independent of ^. 
The following result, which appeared as Lemma 18 in [5], proves that this algorithm computes 
IIy k^-^) ^ constant factor approximation to IIy fc(-^)- 

Lemma 19. [Lemma 18 in [5j] Given A G M™'^", Y G M™^'', and an integer k < r, the matrix 
QXopt G M"*^" described above can be computed in O (mnr + (m + n)r^) time and satisfies: 

\\A-QXopt\\l = \\A-U^^,{A)\\l 
\\A-QXopt\\l < 2||A-n2 (A)||2. 
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5.1.1 Matrix Pythagoras and generalized least-squares regression 



Lemma 20 is the analog of the Pythagoras theorem in the matrix setting. A proof of this lemma 
can be found in [5]. Lemma 21 is an immediate corollary of Matrix-Pythogoras. 



Lemma 20. // X, Y G M"*""" and XY'^ = 0„xm or X^Y = O^xn, then for both C = 2, F : 

||X + Y||| < ||X||| + ||Y|||. 
^mxn^ C G M"*><^ and for all X G M"^^" and for both C = 2, F : 
||A-CC+A||| < ||A-CX|||. 



Lemma 21. Given A G 



Proof Write A - CX = (I - CC+)A + C(C+A - X). Observe that ((I - CC+)A)TC(C+A) = 
Onxn- By Lemma 20, ||A-CX|||> ||(I - CC+)A||| + ||Ci(C+A - X)||| > ||(I - CC+)A|||. ■ 

5.1.2 Low-rank matrix approximation based on projections 

The low-rank matrix approximation algorithm investigated in this paper is an instance of a wider 
class of low-rank approximation schemes wherein a matrix is projected onto a subspace spanned 
by some linear combination of its columns. The problem of providing a general framework for 
studying the error of such projection schemes is well studied [9, 27, 5]. The following result 
appeared as Lemma 7 in [9] (see also Theorem 9.1 in [27]). 

Lemma 22. [Lemma 7 in [9]] Let A G W^^'" have rank p. Fix k satisfying < k < p. Given a 
matrix ft G M"^'', with r > k, construct Y = Af2. // V^O has full row rank, then, for ^ = 2,F, 



|A- YYtAII? < ||A-n^,(A) 



^ 11^ - -^^Y,A:V^;il? ^ l|A - Afclll + \\'Ep_kV^_^ic,y v ^ 



n(vffi)^ III 



(18) 



This lemma provides an upper bound for the residual error of the low-rank matrix approxi- 
mation obtained via projections. We now prove a new result for the forward error. 



Lemma 23. Let A G 



have rank p. Fix k satisfying < k < p. Given a matrix ft G 



with r > k, construct Y = Aft. If ft has full row rank, then, for ^ = 2,F, 
IIAfe - YYtAlll < ||A - Afelll + ||I]p_fcVj_jfcn(vTn)^|||. 



(19) 



Proof For both ^ = 2, F, 



|Afc- YYtAlll 



< 

< 



< 



YYtA^_fc||| 

2 

p-k\\^ 

Y(V?0)Vt||| + ||A,_,||| 



Afc - YYt Afc 

Afe-YYtAfc||| + ||A 
Afe-Y(VT 



A, - Vk-SkV^niV^nyvj: + A,_kn{V^n)W^\\l + \\A,_k\\'l 
A,_fen(vTn)VT||| + ||A,_fc||| 

Up-ifc||i||Sp_fcVp_jfcO(V^S)^|||||V^||| -I- ||Ap_jt||| 
S^_fcV^_feJI(VTj|)t||| + ||A^_fe||| 

In the above, in the first inequality we used Lemma 20 ((A/j — YYtAfc)(— YY^Ap./j)""" = 
Omxm because AjfcAj_j, = Omxm)- In the second inequality we used Lemma 21 (with X = 
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(V^ r2)^ V^). In the third equahty, we used the fact that (V^ r2)(V^ri)"'" = I^, since, by assump- 
tion, rank(V^f2) = k. In the last inequaUty we used the sub-multiphcativity property of the 
spectral and Probenius norms, i.e for any three matrices X, Y, Z: 

||XYZ||| < ||X||i||YZ||| < ||X||i||Y|||||Z||i. 



5.1.3 Least squares regression based on projections 

Similarly, one of the two SRHT least squares regression algorithms analyzed in this article is an 
instance of a wider class of approximation algorithms where the dimensions of the input matrix 
and the vector of the regression problem are reduced via pre-multiplication with a random matrix. 
Lemma 9 in [7] provides a general framework for the analysis of such projection algorithms. 

Lemma 24. Let A G M"*^" (m> n) of rank p and b G be inputs to the least squares problem 
miUxeR" ||Ax — b||2. Let U G W^^^ contain the top p left singular vectors of A and let ft € M*"^^ 
(p < r < m) be a matrix such that rank(fi"'"U) = rank(U). Then, 

II Axopt - b||i < II Axopt - bill + II (J2Tu)^f2T (Axopt - b) ||i. 

In the above, ii.opt = A^b and li-apt = (r2^A)^i7^b. 

The following lemma is a restatement of Lemma 2, along with Eqn. (9) and Eqn. (11) in [19]. 
It gives a bound on the forward error of the approximation of a least-squares problem that is 
obtained via projections. In [19] the parameters a and /3 are fixed to a = l/\/2 and P = e/2, 
for some parameter < e < 1. Showing the result for general a > and /5 > is straightforward, 
hence a detailed proof is omitted. 

Lemma 25 (Lemma 2 in [19]). Let A € ]R™x"- (m > n) of rank p = n and b G be inputs 
to the least squares problem miUxeR" ||Ax — b||2. Let U G M™^^ contain the top p left singular 
vectors of A and let il G M"*^'' (p < r < m). For some a > 0, and P > 0, assume that 

<7min (U^n) > al (20) 

and 

llU^nfiT (Ax„pt - b) g < /3\\A^opt - h\\l (21) 

Furthermore, assume that there exists a 7 G (0, 1] such that ||UAU^b||2 > 7||b||2. Then, 

1 

||xopt - Xopt||2 < • (^At(A)V7~^ - 1) l|Xopt||2. (22) 

In the above, Xopt = A^b and Xopt = (0"^A)^ri"^b. 

5.2 Proof of Theorem 4 
5.2.1 Frobenius norm bounds 

We first prove the Probenius norm bounds in the theorem (i.e Eqns. (i), (ii), (iii), and (v)). We 
would like to apply Lemma 22 with fl = 0^ G M"^*" and ^ = F. Notice that because of our 

assumption that 

r > 6&e-^ [Vk + ^8\n{n/S)] ^ ln{k/S), 
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rTciTxt||2 



where C > 1, Lemma 6 implies that with probabihty at least 1 — 36, 

rank(V^0^) = k; 
so, for ^ = F, Lemma 22 applies with the same probability, yielding 

||A - YYtAlll < ||A - n^,fc(A)||2 < ||A - Afe||| + ||S,_,Vj_fce^ (V,^ ; up. 
We continue by bounding the second term in the right hand side of the above inequality, 
S := ||S,_,Vj_,©T(V^0T)t||2 

< 2||s,_,vJ_,0T0v,||| + 2||s,_,vJ_,0T((vTeV-(v^eT)T)||2 

< 2||5],_,Vj_,0T0V,||| + 2||S,_,Vj_,0^^1||||(V^0T)t_(vT^^^^ 



(23) 



< 8e-||E,_feVj_,||| + 2. (H||s^_,vJ_,|||) •(2.386) 



< 22e • ||5]p_fc||p. 

In the above, in the first inequality we used the fact that for any two matrices X, Y : ||X + 
Y||| < 2||X||| + 2||Y|||. To justify the first estimate in the third inequality, first notice that 
Vj_j;.Vfe = 0„xfe- Next use Lemma 16 with R = C^/hi{k/S). Prom the lower bound on r, we have 
that 

l + ^81n(n/(5) ~ l + y/8\n{n/6) 



so this choice of R satisfies the requirements of Lemma 16. Apply Lemma 16 to obtain 

r 



p-ky p-k\\F 



>l_e-i^V4_2<5. 



Use the lower bound on r to justify the estimate 

l2 



6C2e-i [^/k + v'81n(n/(5)] ln{k/S) 

2e (CVln(fc/(5) + 1)^ 
3 ' C2ln(A;/(5) 



< 



2£ 



1 + 



Cy/Hk/6) 



This estimate implies that 



2e 



|I]p_itVj_fe0T0Vfe|||<- 1 + 



p-ky p-k\\F 



C^yin{k/S) J 

Since C > 1 and > 2, a simple numerical estimation allows us to state that, more simply, 
P [||Sp-feVj_,0T0V,||| < 4£||E,_,Vj_,|||] > 1 - 5^^Mfc/5)/4 _ 
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The remaining estimates in the third inequahty fohow from applying Lemma 6 (keeping in mind 
our lower bound on r) to obtain 



ll(V^0^)^-(V^0T)T||2<2.38e 



> 1 - 3(5. 



and Lemma 15 with = 7/4 to obtain 

11 



p-fc ♦ p-k^ IIF ^ 



p—k 

We have the estimate 



,,V \/T II 2 

-^W^p-k^ p-k\\¥ 



> 1 



(1 + 7/4)1+7/4 



r/(l+^81n(n/5))2 



.7/4 



< 



(1 + 7/4)1+7/4 e' 



so in fact 



T|i2 



< 



11, 



> 1 _ (,-6C2e-i ln(fc/5) _ ^ 



> 1 - 2(5. 

Combining (23) with the bound on S", we obtain 

||A - YYtAlll < ||A - n^,fc(A)||| < (1 + 22e) • ||A - Afc|||. 

Taking the square-roots of both sides and using the fact that yjl + 22e < 1 + 22e gives the bound 

||A - YY^AIIf < ||A-n^^fc(A)||F < Vl + 22£ • ||A- AfeUp < (1 + 22^) ||A- AfcHp. 

Eqn. (i) in the theorem follows directly from this: 

||A - YY^AIIf < (1 + 22£) ||A - AfeHp. 

To derive Eqn. (ii), recall the equality || A — AfeUp = || A — IIy ;;.(A)||f, established in Lemma 19. 
From this it follows that 

||A- AfellF < (l + 22e) ||A- AfcllF 

also. 

We now prove Eqn. (iii) in the theorem. Eqn. (19) with ^ = F and Q, = @^ G M"^'' gives 
IIAfe - YYtAlll < II A - AfellF + ||5]^_fcVj_fe0T(vTeT)^||2 . 
Now recall the bound for S: 

||5]^_fcVj_feGT(Vfe^©'^yi|2 < 22e||A- Afclll. 

So, 

II Afc - YYtAlll < ||A - Afclll + 22£ • || A - Afe||| = (1 + 22^) • || A - Afe|||. 

Taking the square-roots of both sides and using the fact that ^/T+~22e < 1 + 22£ gives Eqn. (iii). 
Finally, we prove Eqn. (iv): 

IjAfe - AjfcllF = ||A - Afe - (A - Afe)||F < ||A - Ajfc||F + ||A - Ajk||F < (2 + 22e)||A - A^Hf, 

where the first inequality follows by the triangle inequality and the second by using the bound 
obtained in Eqn. (ii) in the theorem. 

The failure probability in the theorem follows from a union bound on all the probabilistic 
events involved in bounding S. 
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5.2.2 Spectral norm bounds 

We now prove the spectral norm bounds in Theorem 4 (i.e Eqns. (v), (vi), (vii), and (vhi)). 
Lemma 6 implies that, with this choice of r, 

ll(v^eVlli<(i-V^)-\ 

with probability at least 1 — 35. Consequently, Vj©"'" has full row-rank and Lemma 22 with 
ft = 0^ G M"^*" and ^ = 2 applies with the same probability, yielding 

||A - YYtAlli < ||A - Afelli + (1 - V£)-'||S^-feVj_fe©T||i (24) 

Also, the spectral norm bound in Lemma 19 implies 

||A - Afelli < 2|| A - n2._,(A)||i < 2 (||A - A^Hi + (1 - V^)-1||I],_,vJ_,0T||2) . (25) 

We now provide an upper bound for ■\/Z where Z is the scalar 

Z := ||A - Afelli + (1 - V^)-i||I]p_feVj_,eT||i. 
Prom Lemma 13 we obtain 

^ < (l + ■ " "^^"^ + (fZ^ (l|A - AfcllF + V81n(n/5)||A - Afc||2)' 

with probability at least 1 — 56. Using that e < 1/3, we see that (1 — ■s/e)^^ < 3, so 

Z < 16 • ||A - Afelli + (^||A - AfellF + V'81n(n/<5)||A - A^hY ■ 

Use the subadditivity of the square-root function and rearrange the spectral and Frobenius norm 
terms to obtain that 



r \ r 



Apply Eqn. (24) to arrive at Eqn. (v) in the theorem, 

IIA - YYtAlb < (4 + .^EmSHU] . liA - + .SI . I, A - A.II.. 



Take the square root of both sides of Eqn. (25), use the subadditivity of the square root function, 
and use the bound for y/Z to find Eqn. (vi): 



r 



We now derive the spectral norm bounds on the forward errors. Eqn. (19) with O = G 
IIAfe - YYtAlli < ||A - Afe||2 + ||X;^_fcVj_jt©T(vJ©T)^||i. 
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Use the inequality ||(Vj0^)"^||| < (1 - ^/e)-^ to obtain 

\\Ak - YYtAlli < ||A - Afclli + (1 - Vi)"'||Sp_fcVj_fc0 
Take the square-root of both sides of this inequahty to obtain 



T||2 
2- 



IIAfe - YYtA||2 < -^||A - AkWl + (1 - V^)-'l|5^P-feVj_fe0T||i 
and identify the righthand side as s/Z. Use the bound on s/Z to arrive at Eqn. (vii): 



I A. - YYtAlb < I 4 + ^ ^Hn/S)H.m \ . ||^ _ ^^||^ ^ . ||a - A„„. 

We now prove Eqn. (viii). First, recall that Afc = QXopt and observe that 

||Afc — Afc||2 = ||Afc + Ap_fc — Afe — Ap_fe||2 < ||A — Afc||2 + ||Ap_fc||2. 
Now, recall Eqn. (vi). 



I A - A.IU < I 6 + , . II A - A.|b + , ™ . IIA - A.b. 



^ r j ' V r 

In conjunction with the previous inequality, this gives us the desired bound: 



I A - A.fe < : 7 + ^'1?MVW) J . 11^ _ ^^11^ ^ ^ein^ . 11^ _ ^^11^. 

Finally, we recall that the two probabilistic events we used in our derivations — that (V^©"'")^ 
is bounded and that our application of Lemma 13 succeeds — hold with probabilities at least 1—35 
and 1 — 2(5 respectively, so the failure probability for each of these four spectral error bounds is 
no more than 55. 

5.3 Proof of Theorem 5 

To prove the first bound in the theorem (residual error analysis), we will use Lemma 24, which is 
the analog of Lemma 22 but for linear regression. Using this lemma, the proof of the first bound 
in Theorem 5 is similar to the proof of the first Frobenius norm bound of Theorem 4. 

Wc would like to apply Lemma 24 with ft = @^ G MJ^^^. For convenience, we take U = Ua- 
Notice that Lemma 6 implies that with probability at least 1 — 36, rank(0U) = p = n; so. 
Lemma 24 applies with the same probability, yielding 

\\A5copt - Hi < WA^opt - b||i + ||(0U)t0 {A^opt - b) \\l (26) 

Wc continue by bounding the second term in the right hand side of the above inequality (for 
notational convenience, let Zopt = Axopt — b), 

S := ||(0U)t0(Ax„pt-b)||2 



|2 

opt 1 1 2 

2 



< 2\\U^&^@Zopt\\l + 2||((0U)t - (0U)T)0z, 

< 2||uT0T0z,pt||2 + 2||((0U)t - ie\jf)\\l\\ezopt\\2 
= 2||zT,0T0U||i + 2||((0U)t - (0U)T)||i||zT,0 

< 8£-||zopt||i + 2-(2.38£)- (^^WzoptWi^ 

< 22s ■ llzoptlll- 



T||2 
2 
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In the above, in the first inequality we used the fact that for any two matrices X, Y : ||X+ Y||p < 
2||X|||+2||Y|||. To justify the first estimate in the third inequaUty, first notice that z^^U = Oixn 
since 

zT = {A^opt - b)^ U = (AA+b - b)^ U = (UU^b - b)^ U = Oix„. 

Next, use Lemma 16 with R = Cy/Hn/d). Recall that r > GC^e"! + y/8ln{m/d)f ln{n/S), 
where C > 1, so 

\/r I T \fn + -v/81n(m/(5) „ r — ; — — _ 

l + y/8ln{m/6) " l + ^y8ln{m/6) 



and this choice of R satisfies the requirements of Lemma 16. Apply Lemma 16 to obtain 

> 1 _ e-RV4 _ 26. 



Manipulations similar to those used in proving the first Frobenius norm bound in Theorem 4 
show that the lower bound on r implies 

F [||zJ,,0T0U||i < Ae\\zl,\\l] > 1 - - 26. 

The remaining estimates in the third inequality follow from applying Lemma 6 to obtain 



||((eu)t-(eu)T)||i<2.38e 

and Lemma 15 with = 7/4 to obtain 



11, 



,T ||2 



Tll2 < „ 
2 ^ ^ \\^opt\\2 



> 1 



=7/4 



> 1 - 3(5. 



(1 + 7/4)1+7/4 



Manipulations similar to those used in proving the first Frobenius norm bound in Theorem 4 
show that the latter bound implies 



UT CkT||2 ^ 11 ||„T ||2 

IZopt© lb < -jW^opth 



> 1 - 2(5. 



Combining (26) with the bound on S, we obtain 

\\Axopt - h\\l < (1 + 22£) • II Axopt - b||i. 

Taking the square-root to both sides and using the fact that ^/l + 22e < 1 + 22£ gives the bound 
in the theorem, 

II Axopt - b||2 < (1 + 22e) • || Axopt - b||2. 

The failure probability in the theorem follows by a union bound on all the probabilistic events 
involved in the proof. 

We now prove the forward error bound in the theorem. Towards this end, we will use 
Lemma 25 with f2 = &^ . Recall that Eqn. (22) in the lemma. 



I ^opt ^opt 1 1 2 



■ (k(A)x/7"^ - l) Ijxoptlh, 
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is satisfied if a and /3 satisfy Eqns. (20) and (21) respectively, and 7 G (0, 1] satisfies ||UU"'^b||2 > 
7||b||2. By hypothesis, such a 7 exists. We now show that appropriate a and /3 exist. 
Lemma 6 imphes that 

f7^in(UTeT) > (1- V^)5 

so a = 1 — ^/e satisfies Eqn. (20). In the proof of the residual error bound in this theorem, we 
showed that 

||UT0T0 (Axopt - b) ||2 < 4£||Axopt - bill, 

so ^ = 4£ satisfies Eqn. (21). With these choices of a and /3, Eqn. (22) in Lemma 25 gives the 
claimed forward error bound. 

6 Experiments 

In this section, we experimentally investigate the tightness of the residual and forward error 
bounds provided in Theorem 4 for the spectral and Frobenius norm approximation errors of SRHT 
low-rank approximations of the forms YY^A and A^ = QXop^. Additionally, we experimentally 
verify that the SRHT algorithm is not significantly less accurate than the Gaussian low-rank 
approximation algorithm. 

6.1 Test Matrices 

Let n = 1024 and consider the following three test matrices: 

1. Matrix A G m('^+i)x" is given by 

A = [lOOei + 62, lOOei + 63, . . . , lOOei + Sn+i], 
where Cj G M^^^ are the standard basis vectors. 

2. Matrix B G M""^" is diagonal with entries {B)ii = 100 * (1 - (i - l)/n). 

3. Matrix C G M"^** has the same singular values as B, but its singular spaces are sampled from 
the uniform measure on the set of orthogonal matrices. More precisely, C = UBV^, where 
G = USV^ is the SVD of an tt, x n matrix whose entries are standard Gaussian random 
variables. 

These three matrices exhibit properties that, judging from the bounds in Theorem 4, could 
challenge the SRHT approximation algorithm. Matrix A is approximately rank one — there is a 
large spectral gap after the first singular value — but the residual spectrum is fiat, so for > 1, 
the II A — A^IIf terms in the spectral norm bound of Theorem 4 are quite large compared to the 
II A — Afc||2 terms. Matrices B and C both have slowly decaying spectrums, so one again has a 
large Frobenius term present in the spectral norm error bound. 

B and C were chosen to have the same singular values but different singular spaces to reveal 
any effect that the structure of the singular spaces of the matrix has on the quality of SRHT 
approximations. The "coherence" of their right singular spaces provides a summary of the relevant 
difference in the singular spaces of B and C. Let <S be a /c-dimensional subspace, then its coherence 
is defined as 

/i(5) = maxPjj, 
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where P is the projection onto S; the coherence of S is always between k/n and 1 [10]. It is 
clear that all the right singular spaces of B are maximally coherent and it is known that with 
high probability the dominant right /c-dimensional singular space of C is quite incoherent, with 
coherence on the order of maxjfe, log n}/n [10]. 

To gain an intuition for the potential significance of this difference in coherence, consider 
a randomized column sampling approach to forming low-rank approximants; that is, consider 
approximating M^. with a matrix YY^M where Y comprises randomly sampled columns of M. 
It is known that such approximations are quite inaccurate unless the dominant fc-dimensional right 
singular space of M is incoherent [41, 23]. One could interpret SRHT approximation algorithms as 
consisting of a rotation of the right singular spaces of M by multiplying from the right with DH"*" 
followed by forming a column sample based approximation. The rotation lowers the coherence of 
the right singular spaces and thereby increases the probability of obtaining an accurate low-rank 
approximation. One expects that if M has highly coherent right singular spaces then the right 
singular spaces of MDH"'" will be less coherent but possibly still far from incoherent. Thus we 
compare the performance of the SRHT approximations on B, which has maximally coherent right 
singular spaces, to their performance on C, which has almost maximally incoherent right singular 
spaces. 

6.2 Empirical compcirison of the SRHT and Gaussian algorithms 

Figure 1 depicts the relative residual errors of the Gaussian and SRHT algorithms for both 
approximations addressed in Theorem 4: YY^M and QX^pi, which we shall hereafter refer to 
respectively as the non-rank-restricted and rank-restricted approximations. Here the matrix M is 
used to refer interchangeably to A, B, and C. The relative residual errors (||M-YYtM||g/||M- 
Mfc|[^ and ||M — QXop^ ||^/||1V[ — Mfc||j for ^ = 2,F) shown in this figure for each value of k were 
obtained by taking the largest of the relative residual errors observed over 10 trials of low-rank 
approximations each formed using r = [2fclnn] samples. 

With the exception of the residual spectral errors on dataset A, which range between 2 and 9 
times greater than the optimal rank-A; spectral residual error for k < 20, we see that the residual 
errors for all three datasets are less than 1.1 times the residual error of Mjt if not significantly 
smaller. Specifically, the relative residual errors of the restricted-rank approximations remain less 
than 1.1 over the entire range of k while the relative residual errors of the non-rank-restricted 
approximations actually decrease as k increases. 

By comparing the residual errors for datasets B and C, which has the same singular values as B 
but is less coherent, we see evidence that the spectral norm accuracy of the SRHT approximations 
is increased on less coherent datasets; the same is true for the Probenius norm accuracy to a lesser 
extent. The Gaussian approximations seem insensitive to the level of coherence. Only on the 
highly coherent dataset B do we see a notable decrease in the residual errors when Gaussian 
sampling is used rather than an SRHT; however, even in this case the residual errors of the 
SRHT approximations are comparable with that of Bj^. In all. Figure 1 suggests that the gain 
in computational efficiency provided by the SRHT does not come at the cost of a significant loss 
in accuracy and that taking r = [2/clnn] samples suffices to obtain approximations with small 
residual errors relative to those of the optimal rank-fe approximations. Up to the specific value 
of the constant, this latter observation coincides with the conclusion of Theorem 4. 

Figure 2 depicts the relative forward errors of the Gaussian and SRHT algorithms (||Mfe — 
YY't'Mll^/llMfell^ and \\Mk - QXoptllg/llMfcH^ for ^ = 2,F) for the non-rank-restricted and 
rank-restricted approximations. The error shown for each k is the largest relative forward error 
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Relative residual spectral erroiH for A 



IIA-YYtAlij/IIA-Ai-fc SRHT 
II A - QX„^,||2/||A - Atlli, Gaussian 
l|A-QX„„||2/||A-Ai.||2. SRHT 
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Figure 1: Relative spectral and Frobenius norm residual errors of the SRHT and Gaussian low- 
rank approximation algorithms (||M - YY"I'M||5/||M - Mfc||^ and ||M - QXopt ||^/||M - MfcH^ 
for ^ = 2, F) as a function of k for the three datasets M = A, B, C. Each point is the worst of 
the errors observed over 10 trials, r = \2k In n] column samples were used in each trial. 
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Figure 2: The relative spectral and Frobenius norm forward errors of the SRHT and Gaussian 
low-rank approximation algorithms (||Mfc — YY^M||^/||Mfc||^ and HM^ — QXop^ ||^/||Mfc||^ for 
^ = 2,F) as a function of k for the three datasets M = A, B, C. Each point is the worst of the 
errors observed over 10 trials, r = [2fclnn] column samples were used in each trial. 
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observed among 10 trials of low-rank approximations each formed using r = [2A;lnn] samples. 
We observe that the forward errors of both algorithms for both choices of sampling matrices 
are on the scale of the norm of Mfc. By looking at the relative spectral norm forward errors we 
see that in this norm, perhaps contrary to intuition, the rank-restricted approximation does not 
provide a more accurate approximation to Mjt than does the non-rank-restricted approximation. 
However the rank-restricted approximation clearly provides a more accurate approximation to 
Mfc than the non-rank-restricted approximation in the Frobenius norm. A rather unexpected 
observation is that the rank-restricted approximations are more accurate in the spectral norm for 
highly coherent matrices (B) than they are for matrices which are almost minimally coherent (C). 
Overall, Figure 2 suggests that the SRHT low-rank approximation algorithms provide accurate 
approximations to when r is in the regime suggested by Theorem 4. 

6.3 Empirical evaluation of our error bounds 

Figures 1 and 2 show that when r = [2A:lnn] samples are taken, the SRHT low-rank approx- 
imation algorithms both provide approximations to M that are within a factor of {I + e) as 
accurate in the Frobenius norm as M^, as Theorem 4 suggests should be the case. More pre- 
cisely, Theorem 4 assures us that 528e^^[^/k + y^S ln(8n/5)]^ ln{8k/5) column samples are suf- 
ficient to ensure that, with at least probability 1 — S, YY^M and QXopt have Frobenius norm 
residual and forward error within (1 -|- e) of that of M^. The factor 528 can certainly be reduced 
by optimizing the numerical constants given in Theorem 4 (as noted after the statement of the 
Theorem). But what is the smallest r that ensures the Frobenius norm residual error bounds 
||M-YYtM||F < (l + e)||M-Mfe||F and ||M - QXoptllr < (l + e)||M-Mfc||F are satisfied with 
some fixed probability? To investigate, in Figure 3 we plot the values of r determined empirically 
to be sufficient to obtain (1 + e) Frobenius norm residual errors relative to the optimal rank-/c 
approximation; we fix the failure probability 5=1/2 and vary e. Specifically, the r plotted for 
each k is the smallest number of samples for which ||M — YY^M||f < (1 -I- — M/j||f (or 

||M - QXoptllF < (1 + e)||M - MfcllF) in at least 5 out of 10 trials. 

It is clear that, for fixed k and e, the number of samples r required to form a non-rank- 
restricted approximation to M with (1 + e) relative residual error is smaller than the r required 
to form a rank-restricted approximation with (1 -|- e) relative residual error. Note that for small 
values of k, the r necessary for relative residual error to be achieved is actually smaller than k 
for all three datasets. This is a reflection of the fact that when ki < k2 are small, the ratio 
||M — Mfc2 IIf/IIM — Mfc^ ||f is very close to one. Outside of the initial flat regions, the empirically 
determined value of r seems to grow linearly with k; this matches with the observation of Woolfe 
et al. that taking r = k + 8 suffices to consistently form accurate low-rank approximations using 
the SRFT scheme, which is very similar to the SRHT scheme [45]. We also note that this matches 
with Theorem 4 which predicts that the necessary r grows at most linearly with k with a slope 
like Inn. 

Finally, Theorem 4 does not guarantee that (1 -|- e) spectral norm relative residual errors can 
be achieved. Instead, it provides bounds on the spectral norm residual errors achieved in terms of 
||M — Mfc||2 and ||M — Mj^IIf that are guaranteed to hold when r is sufficiently large. In Figure 4 
we compare the spectral norm residual error guarantees of Theorem 4 to what is achieved in 
practice. To do so, we take the optimistic viewpoint that the constants in Theorem 4 can be 
optimized to unity. Under this view, if more columns than r2 = e~^[-\/k + ^yhl{n/^)]^ hi{k/S) are 
used to construct the SRHT approximations, then the spectral norm residual error is no larger 
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Figure 3: The value of r empirically necessary to ensure that with probability at least 1/2, 
approximations gener ated by the SRHT algorithms satisfy ||M - YY+MHf < (1 +e)||M - MfcUp 
and ||M- QXopillF < (1 + e) ||M - M^Hf (for M = A, B, C). 
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Figure 4: The empirical spectral norm residual errors relative to those of the optimal rank-Zc 
approximants (||M - YYtM||2/||M - Mfc||2 and ||M - QXopt||2/||M - Mfc||2) plotted alongside 
the same ratio for the bound given in Theorem 4, when r = [2[\/fe + Y^ln(2n)]^ ln(2fe)] (for 
M = A,B,C). 

than 

where p is the rank of M, with probability greater than 1 — 6. Our comparison consists of using r2 
samples to construct the SRHT approximations and then comparing the predicted upper bound 
on the spectral norm residual error, 621 to the empirically observed spectral norm residual errors. 
Figure 4 shows, for several values of k, the upper bound 62 and the observed relative spectral norm 
residual errors, with precision parameter e = 1/2 and failure parameter 5 = 1/2. For each value 
of k, the empirical spectral norm residual error plotted is the largest of the errors from among 
10 trials of low-rank approximations. Note from Figure 4 that with this choice of r, the spectral 
norm residual errors of the rank-restricted and non-rank-restricted SRHT approximations are 
essentially the same. 

Judging from Figures 3 and 4, even when we assume the constants present can be optimized 
away, the bounds given in Theorem 4 are pessimistic: it seems that in fact approximations with 
Frobenius norm residual error within (1 + e) of the error of the optimal rank-/c approximation 
can be obtained with r linear in k, and the spectral norm residual errors are smaller than the 
supplied upper bounds. Thus there is still room for improvement in our understanding of the 
SRHT low-rank approximation algorithm, but as explained in Section 2.1, Theorem 4, especially 
the spectral norm bounds, represents a significant improvement over prior efforts. 

To bring perspective to this discussion, consider that even if one limits consideration to deter- 
ministic algorithms, the known error bounds for the Gu-Eisenstat rank-revealing QR — a popular 
and widely used algorithm for low-rank approximation — are quite pessimistic and do not reflect 
the excellent accuracy that is seen in practice [22]. Regardless, we do not advocate using these 
approximation schemes for applications in which highly accurate low-rank approximations are 
needed. Rather, Theorem 4 and our numerical experiments suggest that they are appropriate in 
situations where one is willing to trade some accuracy for a gain in computational efficiency. 
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